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ABSTRACT 

We propose a new method to visualize gene expression experiments inspired by 
the latent semantic indexing, technique originally proposed in the textual anal¬ 
ysis context. By using the correspondence word-gene document-experiment, 
we define an asymmetric similarity measure of association for genes that ac¬ 
counts for potential hierarchies in the data, the key to obtain meaningful gene 
mappings. We use the polar decomposition to obtain the sources of asymmetry 
of the similarity matrix, which are later combined with previous knowledge. 
Genetic classes of genes are identified by means of a mixture model applied in 
the genes latent space. We describe the steps of the procedure and we show 
its utility in the Human Cancer dataset. 

Keywords: Latent semantic indexing, Asymmetric similarities, Gene expres¬ 
sion data, Textual data analysis. 



1 Introducction 


A gene expression dataset consists of a matrix Y 6 IR nxp , with each row representing an 
experiment and each column representing a gene. Typically, the number of genes is several 
thousand, whereas the number of experiments or samples is in the order of tens. In Figure 
0A we show the heat map of the differentially expressed genes of the Human cancer dataset, 
which originally consists of 6830 genes measured in 64 experiments corresponding to 14 


different types of Cancer patients available in Hastie et al. (2009). To provide answers to 


questions like which genes are more similar in terms of their expression profiles or which genes 
are involved in certain types of cancer is the key to extracting useful biological knowledge in 
experiments of this type. 

A common strategy to find interesting patterns in the data is to define some measure 
of similarity or dissimilarity for the genes (Priness et al., 2007 Kim et al. |2007), which 


is later combined with a cluster algorithm (Kohonen et al., 2001 Gat-Viks et al., 2003). 


The Euclidean distance, the Pearson correlation coefficient or the Mutual Information, are 
the most common measures. Although useful in many scenarios, such measures are unable 
to capture some complex features that have been discovered to be present in the way the 
genes interact with each other. Particularly, an interesting case is the hierarchy among the 
genes, an universal pattern that has been extensively observed in the literature, mainly in 


the context of networks analysis (Reka and Barabasi 2002; Wuchty et al. 2003 Barabasi 


and Oltvai, 2004). 


Inspired by the latent semantic indexing (LSI) (Deerwester, 1988 Deerwester et al., 1990), 
the technique originally proposed for textual data analysis, in this paper we propose a new 
visualization technique to unravel the structure of gene expression datasets. Although the 
idea of using textual data analysis techniques in the biological context has been explored 


in the literature in some recent works (Bicego et al. 2010 Ng et al., 2004 Caldas et al. 


2009), these approaches use the Latent Dirichlet Allocation (LDA) (Blei et al. 2003) as a 


fundamental model, which provides neither a Euclidean representation of the genes useful 
for visualization nor takes into account the hierarchical relationship among the genes. In 
this work we address both problems by means of a new asymmetric latent semantic indexing 
approach (aLSI), following the existing literature in asymmetric similarities based methods 


(Okada and Imaizumi, 1987 Okada, 1990 014110,1978,1990 Munoz et al. 2003). Therefore, 


the contributions of this paper are twofold: 

(i) A proof-of-concept analysis to illustrate the importance of using asymmetric gene sim¬ 
ilarities in gene expression experiments. 
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(ii) A new asymmetric latent semantic indexing (aLSI) approach to produce meaningful 
gene mappings, which can be used in combination with previous biological knowledge 
such as gene-ontologies, pathways, protein-protein interaction networks, etc. 


Our approach is inspired by the work of (Munoz and Gonzalez, 2012) in which an asym¬ 
metric version of the LSI is already defined in the textual data context. In this work the 
authors propose a partition of the data in several hierarchical levels, which aim at accounting 
for the hierarchical relationships between the words of the database. Within each level, a 
Gram Mercer kernel matrix is obtained by means of the triangular decomposition, which 
captures the remaining asymmetries not removed by the partition in the different layers. 
Finally, a Euclidean representation of the words is produced within each level and these are 
connected using a measure of inclusion. 

In this work, we propose an alternative aLSI which does not require a partition of the 
dataset in hierarchical levels. This represent itself and advantage with respect to the work 


of (Munoz and Gonzalez, 2012) since the choice of the number of layers its already avoided. 
Nevertheless, the key aspect of our approach is to replace the triangular decomposition 
of the similarity matrix by the polar decomposition, which produces two complementary 
gene representations. This allows us to produce a global mapping that does not require 
any partition of the data while the information provided by the asymmetries in the gene 
similarity matrix is still taken into account. 

This paper is organized as follows. In Section [2] we detail the connection between asym¬ 
metric similarities and hierarchies in genetic experiments and we illustrate this phenomenon 
in the Human Cancer data set. In Section [3] we propose a new asymmetric latent semantic 
indexing (aLSI) procedure. In Section [4] we illustrate the utility of the proposed approach 
in a real data experiment and in Section [5] we conclude with a discussion of this work. 


2 Hierarchy/asymmetry in gene expression experiments 


In this section we illustrate the idea of “gene hierarchy”. To this end, we will use the above 
mentioned Human Cancer data set. Consider the matrix X such that x^- = 1 if the gene 


j is significantly expressed in the experiment k and = 0 otherwise (see Section 4.1 for 
details). This gene-experiment matrix is analogous to the term-document matrix, common 


in textual analysis (Munoz and Gonzalez, 2012). In this field, it is common to work with a 


matrix X where x/y = 1 if the term j appears in document k and x^- = 0 otherwise. By 
using the correspondence genes/words and experiments/documents we can apply techniques 
from the text mining literature to analyse gene expression datasets. Therefore, in the sequel 
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Figure 1: A) Heat-map of the micro-array of the Human Cancer dataset. Originally, there 
are 6830 genes (columns) whose expression is measured in 64 patients (rows) with 14 different 
types of Cancer. Colour intensity represents the level expression of the genes. B) Heap map 
of the Human Cancer dataset in which only the expressed genes are highlighted (in white). 
Each row of this matrix can be interpreted as a document whose words are those genes which 
are differentially expressed. 


we will use indistinctly the terms genes-words and experiments-documents. 

For now, consider a textual data set and let |x;| be the number of documents indexed by 
term ith and |x 2 ; A Xj| the number of documents indexed by both i and j terms. Consider 
the following asymmetric similarity measure (sjj ^ Sji ) 

_ |Xj A Xj| _ Y,k min ( x ik,Xjk) ^ ^ 

Sij — I, — v-^ ) (AC 

| x i| 2^k X ik 

which has been previously studied in a number of works related to Information Retrieval 
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Figure 2: Evidence of the Zipf’s law in gene expression experiments: Histogram of the norms 
of the 2093 differentially expressed genes of the Human cancer data set. 


(Munoz, 1997 Martin de Diego et al., 2010). It turns out that expression (2.1) can be 
interpreted as the degree in which the topic represented by the term i is a subset of the topic 


represented by the term j. As a measure of inclusion it was originally proposed by Kosko 


(1991) in the context of fuzzy set theory. Regarding its interpretation in a textual data 
example, consider, for instance, a collection of documents containing the term “statistics”. 
In this case a more specific term like “non parametric” will occur just in a subset. The 
relation between “non parametric” and ’’statistics” is strongly asymmetric, in the sense that 
the concept represented by the word “non parametric” is a subset of the concept represented 
by the word “statistics” but not conversely. In the biological context, where s t] represents 
the similarity between two genes, expression (2.1) represents the degree in which a gene i is 
a subclass or it is hierarchically dependent of a gene j. 

The matrix X contains information about both, the terms and the documents of the 
database. In the sequel we will use t j to refer the terms (columns of X) and d* (rows of 
X) to refer the documents. Using the definition of similarity in expression (2.1) the skew- 
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Therefore, a large difference between Sjj and Sji is directly related to a large difference 
between the norms of the words given by |tj| and tj |. Thus, the distribution of term norms 
in case of asymmetry/hierarchy is clearly far from being uniform. 

In Figure [2] we show the histogram of the norms of the differentially expressed genes 
of the Human Cancer data set. The figure shows that a few number of genes have very 
large norms while a large number of genes have small norms. This behaviour, which can 
be modelled by means of the Zipf’s law (Martin-Merino and Munoz, 2005), is an evidence 
of asymmetric/hierarchical associations. Genes with large norms correspond to ‘biologically 
relevant’ genes involved in many processes (or high level concepts), whereas genes with small 
norm represent rarely expressed genes (or very specific concepts). The hierarchy induced on 
the gene set by the inclusion measure s l3 is directly related with its asymmetric nature, and 
caused by the strongly asymmetric gene frequency distribution. 


3 Asymmetric latent semantic indexing 


The latent semantic indexing (LSI) (Deerwester, 1988) is a useful technique in natural lan¬ 


guage processing to analyse relationships between a set of documents and the terms they 
contain. The idea is to produce a set of concepts or latent semantic classes to summarize the 
content of the dataset. In this section we propose an asymmetric latent semantic indexing 


that uses as input the similarity in eq. (2.1). In a biological context, we will talk about 


‘latent genetic classes’ to refer to groups of genes that summarize the main content of the 
data. Next, we introduce the LSI to later generalize it to its asymmetric version. 


3.1 Latent semantic indexing 

Consider the n x p document by term X matrix whose entries contain the word counts per 
document. The matrix X T X contains the correlations among terms tj and t*, (measured 
as tjtfc) and XX 7 contains the correlations among documents measured as djdj. Using 
the singular value decomposition (SVD) for X we obtain the unique decomposition X = 
U jSjV 7 , where Ua, and V x are orthogonal matrices and Y) x is diagonal and contains the 
singular values of X. It is straightforward to see that XX 7 = and 

x'x = v x 3£e x v£. 
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(3.1) 












Therefore, the immersion of the term t j into the semantic class space is given by 


tj = 1 l-M.j. 


(3.2) 


On the other hand, the immersion of document dj in the same latent space is given by 

d, = E'V,.d,. 


3.2 Polar decomposition of an asymmetric similarity matrix 


Consider the p x p asymmetric similarity matrix (S)^ = in eq. (2.1). By means of the 


SVD we obtain that S = , which lead to the polar decomposition of S (Horn and 

— t _ tt ~\rT 


R., 

1991 

Highani 

1986) 


Then S = KiL = LK 2 , where 


K: = U s S,U 


K 2 = V S E S V 


T 


(3.3) 

(3.4) 


Note that ||S||f = ||Ki||f = ||K 2 ||f, where || • ||p is the Frobenius norm. Also remark that 
S does not directly decompose in any combination of Kx and K 2 but these matrices can be 
understood as the two sources of asymmetry of S. Geometrically speaking, since SV = US, 
it is straightforward to check that Svj = crjUj where Vj and u j are the columns of U 
and V respectively. Therefore the eigenvectors {vi,..., v p } of K 2 are mapped under the 
asymmetric matrix S onto the scaled orthogonal coordinate system {criUi,. .., cr n u p }. Equiv¬ 
alently, one can interpret the symmetric effect with respect to the eigenvectors (u l5 ..., u p }. 
The asymmetry in S is therefore reflected in the angle between each pair of left and right 
eigenvectors of S. Therefore span{\ i,..., v p } and span{ ui, ..., u p } produce different but 
complementary representations of the genes. Note that if S is a symmetric matrix Kx = K 2 
and therefore both representations are equivalent since Uj = for all j = 1 ,,p. The 
polar decomposition has been previously used in the analysis of asymmetric relationships in 


(Gower, 1977, 1998). 


3.3 Merging the sources of asymmetry 

The matrices Kx and K 2 are symmetric and positive semi-definite. Therefore, they are 


kernel matrices (Aroszajn, 1950 Wahba 1990) that admit the decompositions Kx = $ 1 $^ 
and K 2 = $ 2 $^ where $1 = UE 1 / 2 and d> 2 = VE 1 / 2 respectively. The two matrices induce 
two different distances for the terms, which are the consequence of S of being asymmetric. 
Note that if S is symmetric then <l>x = $ 2 - To find a unifying distance (or kernel) using 
Kx and K 2 is therefore the key to obtain an appropriate Euclidean representation for the 
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terms. In this sense, suppose that we are able to find suitable transformations fa, i — 1,2, 
such that the induced distance on the terms, given by t *,) 2 = \\fa(t j) — fa(tk)\\ 2 , 

corresponds to the one induced by each kernel matrix K;. This implies that (^(tj, t *,) 2 = 
(K i)jj + (K i) kk - 2(K i) jk , where j, k = 1,... ,p and (K) jk = fa(tj) T fa(t k ). 

Following (Gonzalez and Munoz, 2013), it is possible to prove that for each matrix K, 
there exists a symmetry, continuous and positive-definite kernel function K\ : T x T —» IR, 
where T is a compact set, such that Ki(tj, t k ) = <p{tj) T <p{t k ), t j, t k G T is the implicit kernel 
corresponding to d ( j > .(tj,tk). See (Gonzalez and Munoz 2013) for conditions on the existence 
of such ki. Each kernel function k t has a unique associated Reproducing kernel Hilbert space 
(RKHS), whose feature map Q or canonical basis, is given by fa (Aroszajn 1950; Wahba 


1990). 


The operation of adding the kernels k\ and k 2 gives rise to a new RKHS whose feature 
map is the union of fa and fa. In particular, let k\ and k 2 two positive semi-definite kernel 
functions and let fa and 02 their underlying feature maps. Then k = X±ki + \ 2 k- 2 , with 
Ai, A 2 > 0, is a positive semi-definite kernel with 0 = [fafafa, fafafa] as a valid feature 
map. This property, , which can be easily generalized to multiple kernels, implies that 
the sum of the kernel functions k\ and k 2 can be understood as the sum of the associated 
RKHSs. Therefore, to use the operation K = A 1 K 4 + A 2 K 2 , with Ai = A 2 = 1 / 2 , has 
the property of defining a new kernel matrix whose induced distances take equally into 
account the representation of the terms using both kernels, or equivalently in our case, the 
representations of the genes given by $1 = UE 1 / 2 and $2 = VE 1 / 2 . That is, the right and 
left eigenvalues of S have the same weight on the final distance induced by K. 


An alternative fusion scheme can be found in (Munoz and Gonzalez, 2012). However, in 
this work the main step to deal with asymmetry is to split the dataset into layers of words 
with similar norm. Here, we are able to deal with asymmetry in a single step by means 
of the polar decomposition of S. In the former approach, hierarchical clusters of words are 
provided, but a unique representation of the terms is not available as we provide here. This 


represents a problem for the generalization and applicability of the work in (Munoz and 


Gonzalez, 2012) that is solved in our proposal: since the distance among words of different 


layers is not available, this technique cannot be used in problem like classification in which 
a unique distance for the words is needed. 

1 We say that 0 is the feature map of a kernel k : T x T —> IR if fc(t.t') = (0(t),0(t / )) holds for any t, 

t'eT where (■, ■) represents the usual l 2 product. 
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3.4 Generalizing the combination approach 


The goal of this section is to generalize the previous idea described in the previous section 
in order to propose an approach to combine Ki, K 2 and a third matrix W with prior 
information about the problem. Such a matrix might be derived from an initial labeling 
of the terms or the experiments. I 11 the genetic context, this is a natural idea since prior 
knowledge about the relationships among the genes is common (Wang et ah, 2013). Some 


examples are gene-ontologies, pathways, protein-protein interaction networks, etc. Note that 
by imposing K to be positive semi-definite a Euclidean representation of the terms is always 


available by mean of some matrix decomposition K = < 3 > < h J (Schoenberg, 1935 Young and 


Householder, 1938). 


We combine Ki, K 2 and W to obtain a fusion similarity matrix K by maximizing 


G t (K) = ||K - 7 i .F(K 1 , K 2 )|| J + r ||K - 72 W || 2 


f ■> 


(3.5) 


where r > 0 is the regularization parameter, 71,72 > 0 are scale parameters and ^(Ki. K 2 ) 
is a functional combination of the matrices Ki and K 2 whose output is a symmetric positive 


semi-definite matrix. The underlying idea in eq. (3.6) is to merge both sources of asymmetry 


and to keep a balance with the prior knowledge given by W. The fusion scheme proposed in 


eq. (3.6) can be derived using a regularization theory approach, similar to the one used in 


the derivation of SVM classifiers (Martin de Diego et ah, 2010). The solution to the problem 


stated in eq. (3.5) is given in the following proposition, 


Proposition 1 . The minimizer of G T (S) for any T and r > 0 and 71 = 72 = t + 1 is given 
by 


K=T(K u K 2 ) + tW. (3.6) 

Of course, different T lead to different combinations of Ki and K 2 . In this work, and 
based on the ideas described in the previous section, we consider the arithmetic mean of the 


matrices J r (K 1 .K 2 ) = (K^ + K 2 )/2 but we refer to ( 

Martin de Diego et al. 

2010 

Munoz 

and Gonzalez 2007 Munoz and Gonzalez 

2008; 

Munoz et al. 

2006 

) for further kernel fusion 


procedures. 


3.5 Probabilistic latent semantic indexing with asymmetric simi¬ 
larities 


In this section we make use of eq. (3.5), computed from the asymmetric similarity matrix 


S, to redefine the LSI. We use the ideas from (Park and Ramamohanarao 2009 Munoz 








































and Gonzalez, 2012) with the special novelty that the term representation is given by the 


distances induced by our particular choice of K. 


Following the ideas described in Section |3.3[ let 0 be a transformation of the terms such 
that the induced distance on the terms, given by g^( tj,t k ) 2 = ||0(tj) — 0(t fc )|| 2 , corresponds 
to the one induced by the kernel matrix K. Consider the matrix <f>, such that ( < h)p = 

The rows of $, say the <fi(t j), represent the transformation of t j to the latent class/feature 
space. Following the LSI scheme, we apply the SVD to the transformed term p X m matrix 
$ = U£V t and we obtain that K = <h$ T = USS T U T = (UAi)(UAi) T , where A = ££ T = 
£ 2 is the diagonal matrix of eigenvalues of K and £ is the diagonal matrix of singular values 
of <f>. In this context the matrix K plays the role of X r X in the original LSI formulation. 
Then the immersion of 0(tj) is given by 


0 s (t,) = £- 1 U T 0(tj) = A-^U r 0(G) 


Therefore, by replacing X 2 X by K we ‘kernelize’ the LSI by using the original asymmetric 
similarity matrix S: we replace the original linear mapping of the LSA by the non linear one 
give by </>. 

The semantic classes in the latent space can be identified with clusters of transformed 
term data. In order to estimate such semantic classes ci,..., c q we apply a Gaussian mixture 
model-based clustering (Fraleyand and Raftery, 2002). That is, for each term we obtain an 
estimation of the probability of membership, p(ci\tj), to each one of the latent semantic 
classes c t . We assume that each cluster is generated by a Gaussian multivariate distribution 
/fe(t) = Afk{pki £fc), where and E fc are the mean vector and covariance matrix respectively. 
The final mixture density is therefore given by 


q 


q 


/( t) = > y a k Af k { t) = X aA4(jUfc, Efc), 


ZW 

k=1 


k= 1 


where each represents the prior probability or weight of the component k. The main 
advantage of this approach is that we can obtain a density estimator for each cluster and a 
‘soft’ classification rule is available: each term may belong to more than one semantic class 
via the use of conditional probabilities p(ci\tj). 


3.6 Algorithm 

In this section we summarize the steps to apply the proposed asymmetric latent semantic 
indexing to a data set. As we detailed in Section [2j there exist strong similarities between 
textual and gene expression data, therefore our proposal can be used in both scenarios. See 
Table Q] for details. 
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Input: Genes-by-experiments matrix X. 

Output: Map of terms (genes), latent semantic classes. 


1 . 

2 . 

3. 

4. 

5. 

6 . 


Obtain the asymmetric similarity S. 

Decompose S = U s Tj s VJ'. 

Obtain the two sources of asymmetry and K 2 . 
Obtain the matrix of labels of the terms (or genes) W. 


Fuse the matrices using the scheme proposed in (3.6). 


Obtain the projections of the terms into the latent semantic classes. 

7. Assign probabilities to the classes using a mixture model. 

8 . Visualize the genes and the mixture model using MDS. 


Table 1: Main steps of aLSI algorithm. 

4 Application: aLSI of the Human cancer data set 

In this section we analyse the Human cancer data set, described in the introduction of this 


work, by using the proposed asymmetric latent semantic indexing detailed in Section 3.5 


The analysis consists of two main steps. First, we calculate the genes which are statistically 
expressed in each experiment and we obtain the matrix X. Second, we use this matrix to 
obtain genetic semantic classes of genes that we will associate with different types of cancer. 
In order to find the clusters of genes, we also use the Euclidean distance and the Correlation 
matrix to illustrate the benefits of our approach in this context. The R-code to replicate all 
the figures and results of this work is available at https://github.com/javiergonzalezh/aLSI. 


4.1 Differential analysis 

The initial point in our analysis is the matrix Y, which consists of the expression level of 
6830 genes in 64 experiments. The first step is to identify which genes are differentially 
expressed. That is, to statistically decide whether for a given gene its expression is greater 
than what we would expect just due to natural random variations. 

The motivation for this gene filtering is that a relatively few number of genes of the 
database should be expressed in each experiment. Different methods have been proposed 


in the literature (Yang et ah, 2013). In this work, we follow a simple and straightforward 


approach which uses the coefficient of variation CV = \x\/sd(x) to discriminate between ex¬ 
pressed and non-expressed genes. The reason to use this coefficient is the linear relationship 
between the gene expression mean and the gene standard deviation expression of the genes 


across the experiments. See Figure 4.1 A. In particular, we consider that a gene is differ- 
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Figure 3: The first step towards the identification of the latent genetic classes of the database 
is to perform a differential analysis of the genes. A) Mean vs. Standard deviation of the 
6830 genes of the Human cancer data set across the 64 available patients. B) Histogram of 
the CV of all the genes. 


entially expressed in the database if the value of the coefficient of variation is larger than 
0.5. Of course, other thresholds are possible if additional information about the experiential 


noise is available. In Figure 4T A, we show the histogram of the coefficients of variation of 
all the genes of the database. The total number of genes with a CV larger than 0.5 is 2093. 

Given the set of expressed genes, in order to build the matrix X, we need to decide when 
a particular gene is expressed in an experiment. To this end, we consider the maximum of 
the expression in the set of non expressed genes and we use it as a threshold in the set of the 
expressed ones. The purpose of this threshold is to capture the random variation in the data. 
Figure [4] shows the expression values of two genes across the 64 experiments. One of the genes 
(left) is differentially expressed in those experiments above the selected threshold (horizontal 
dotted line at 4.46). In particular, this gene is assumed to be significantly expressed in a 
total of 5 experiments. On the other hand, in Figure [4] (right), we show the expression 
values of a non expressed gene. All the values remain below the threshold, reflecting that 
the variations in expression are random variations. In Figure [l]B, we show the heat map of 
the 2093 differentially expressed genes of the dataset. 
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Differentially expressed gene 


Non differentially expressed gene 




Figure 4: Illustration of the profiles of two genes. On the left we show a differentially 
expressed genes in 5 experiments. On the right we show the profile of a non differentially 
expressed gene. 


4.2 Extraction of latent genetic classes using aLSI 


Next, we apply the asymmetric latent semantic indexing proposed in Section |3.5| to the 
differentially expressed genes of the Human Cancer dataset. To this end, we calculate the 


gene similarity following (2.1) and we proceed with the steps of Algorithm 1. 


The matrix W in expression (3.6) is calculated using the labels of the experiments. First 


we assign a membership of the genes to each one of the 14 types of cancer: “CNS”, “RENAL”, 

“BREAST”, “NSCLC”, “UNKNOWN”, “OVARIAN”, “MELANOMA”, “PROSTATE”, “LEUKEMIA”, 
“K562B-repro”, “K562A-repro”, “COLON”, “MCF7A-repro”, and “MCF7D-repro”. To this 
end, we assign the gene % to the type of cancer k if it is expressed in at least in one of the 
experiments of that type. Note that the same gene might belong to more than one class 
simultaneously. We define the gene similarity matrix Q whose entries are calculated as 


q a = 


T^times gene i and j appear simultaneously in some type of cancer 
^types of cancer in which gene i is expressed 


(4.1) 


The matrix W in (3.6) is calculated as W = (Q 1 + Q 2 )/2 where and Q 2 are the matrices 


resulting from the polar decomposition of Q. Note that the matrix W play the role of the 
labels in the combination, following the idea of kernel combinations in the support vector 


classification context (Martin de Diego et al. 2010). Parameter r is fixed to 0.2 following 
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(Gonzalez and Munoz 2013). 

We apply the aLSI described in Section 3J3 We use a metric Multidimensional scaling 
to obtain a low dimensional representation of the genes, which is shown in Figure [5} Also, 
the projections using the Pearson correlation and the Euclidean distance are shown. The 
Euclidean distance and the Pearson correlation do not show any cluster structure helpful to 
identify groups of genes involved in different cancers. However, the proposed aLSI is able to 
do so. 


In order to interpret such groups we estimate the mixture model described in Section [375 
with 14 groups. Each gene is assigned to a cluster by taking 


class of genei = arg max pic^genei). 

Ci 

The conditional probabilities p(ci\genei) can be interpreted in this context as fuzzy member¬ 
ship degrees. In Table [2] we show the 10 genes with the highest probability of each cluster. 
In Table [3j we show the cross frequencies of the genes in the different types of cancers and 
clusters. Note that the same gene might belong to different cancer groups simultaneously, 
therefore the correspondence clusters-cancer types should not be necessarily one to one. 

Some interesting conclusions show up when the Table[3]is interpreted. BREAST, COLON, 
MELANONA, NSCLS and RENAL cancers seem to be associated to single clusters. The 
cancers K562A-repro and K562B-repro appear clearly together in the same group (group 
9), which also occurs with cancers MCF7A-repro and MCF7D-repro. Apart from the inter- 
pretability of the groups in terms of types of cancers, Table[3]also helps to identify similarities 
between types of cancer. Similar patterns between cancers across the clusters (similar rows) 
can be associated to similar types of cancer. The previously mentioned case of the K562A- 
repro and K562B-repro types is a clear example. A graphic illustration of these results can 
be observed in Figure [6j which shows a Sammon mapping of the 14 latent genetic classes 
(types of cancer) using the results from Table [3] 


5 Conclusions 

In this paper we have proposed a new approach to visualize gene expression experiments. 
The key idea is to use an asymmetric similarity for the genes, which is used within the 
latent semantic indexing context, to obtain latent genetic classes or groups of genes which 
are similar in their expression patterns. We provide both, a Euclidean representation of the 
genes, which is able to illustrate the different genetic patterns of expression in the data set, 
and the probabilities of membership of each gene to those classes. The proposed method has 
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Figure 5: Multidimensional scaling projections (1st, 2nd, 3th) using the similarity ma¬ 
trix produced by the aLSI, the Pearson correlation and the Euclidean distance. The 
groups colouring correspond to the membership of the genes to the different groups of 
cancer: G1 (BREAST), G2 (CNS), G3 (COLON), G4 (K562A-repro), G5 (K562B-repro), 
G6 (LEUKEMIA), G7 (MCF7A-repro), G8 (MCF7D-repro), G9 (MELANOMA), G10 
(NSCLC), Gil (OVARIAN), G2 (PROSTATE), G13 (RENAL), G14 (UNKNOWN). 


been used to analyse the Human Cancer dataset obtaining new and valuable information 
that remains unadvertised to classical similarity measures like the Pearson’s correlation and 
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Figure 6: Sammon mapping of the 14 types of cancers using the results from Table [3] 


the Euclidean distance. 

This work leads to a wide variety of future analysis. On the most theoretical and method¬ 
ological side, the study of the geometrical properties of the matrices Kx and K 2 and of further 
combination procedures are of interest. For instance, we aim to explore the Geometric and 
Harmonic weighted means given by 


^(Ki-KJ = K: /2 (K7 1/2 K 2 Kr 1/2 )'K| /2 , 

^™«ic(Ki,K 2 ) = (fK: 1 + (1 - OKI 1 )" 1 , 


for t G [0,1] and to study their effects in the final genes representation. 

In addition, although we have presented a method in which the sources of asymmetry 
for the genes similarity are merged into a symmetric matrix, it is our plan to investigate the 
potential combinations of our approach with previously developed asymmetric multidimen¬ 


sional scaling techniques (Chino 2012). Also new ways to embed prior knowledge into the 
matrix W will be the focus of further study, which we envision will have a large impact for 
practitioners: in this work we only have considered the experiments labelling to obtain a 
measure of association for the genes. However, in the future it is our aim to consider gene 


15 






ontologies and other topological measures of biological networks, like Protein-Protein inter¬ 
action networks to improve the final gene mapping and the interpretation of the obtained 
gene semantic classes. 


A Appendix 


Proof. (Proposition 1). To maximize G r (K) we take partial the derivative for each S i s . Then 

= 2(K/ S - ^(K h K 2 )^) + 2t ((K) Is - 72(W) is ) (A.l) 

for s, l = 1,..., m,. Setting the previous partial derivatives to zero yields a linear system 
whose unique solution is a matrix K whose elements are given by 

K* = 7l —^(Ki, K 2 ) + 72—--W = ^(Ki, K 2 ) + rW, (A.2) 

r + 1 r + 1 

for l,s — 1,... , m,. To check if K is a maximum or a minimum we evaluate the Hessian 
matrix of G r [S] on K. Such matrix is the n x n diagonal matrix 


H{ K*) = 2 


/ r + 1 0 

0 r + 1 


0 

0 


\ 


V 0 


0 


(A.3) 


r + 1 / 


which is positive definite for any r > 0. Hence, (A.2) is a minimum of (3.5) for any r > 0. □ 


Proposition 2. let k\ and k 2 two positive semi-definite kernel functions and let <f\ and 
f >2 their underlying feature maps. Then k = Ai&q + A 2 k 2 , with Ai, A 2 > 0, is a positive 
semi-definite kernel with f = [\/Ai0i, y/Xifa] as a valid feature map. 


Proof. (Proposition 2). We only need to show that k( t,t') = (0(t),0(t')) is satished for k 
and f. In our case we have that 


(0(t),0(t 7 )) = <(^ti, \^t 2 ), (V^t;, V^t')) 

= Ai(0i(t),0i(t')) + A 2(0200,02(0) 

= Xki(t,t') + X 2 k 2 (t,t') 

= k( t,t'), 


which shows that the proposition holds. 

16 


O. 









Acknowledgments We thank the support of the Spanish Grant Nos. MEC-2007/04438/00 
and DGULM-2008/00059/00. We also thank Georges E. Janssens for hisl elpfull comments 
on the manuscript. 


References 

Aroszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathe¬ 
matical Society, 68(3):337-404. 

Barabasi, A.-L. and Oltvai, Z. N. (2004). Network biology: understanding the cell’s func¬ 
tional organization. Nature Reviews Genetics, 5(2):101-113. 

Bicego, M., Lovato, P., Oliboni, B., and Perina, A. (2010). Expression microarray clas¬ 
sification using topic models. In Proceedings of the 2010 ACM Symposium on Applied 
Computing, SAC TO, pages 1516-1520, New York, NY, USA. ACM. 

Blei, D. M., Ng, A. Y., Jordan, M. I., and Lafferty, J. (2003). Latent diric-hlet allocation. 
Journal of Machine Learning Research , 3:2003. 

Caldas, J., Gehlcnborg, N., Faisal, A., Brazma, A., and Kaski, S. (2009). Probabilistic 
retrieval and visualization of biologically relevant microarray experiments. 

Chino, N. (1978). A graphical technique for representing the asymmetric relationships be¬ 
tween n objects. Behaviormetrika, 5(23-40):59. 

Chino, N. (1990). A generalized inner product model for the analysis of asymmetry. Behav¬ 
iormetrika, 27:25-46. 

Chino, N. (2012). A brief survey of asymmetric mds and some open problems. Behav¬ 
iormetrika,, 39:127-165. 

Deerwester, S. (1988). Improving Information Retrieval with Latent Semantic Indexing. In 
Borgman, C. L. and Pai, E. Y. H., editors, Proceedings of the 51st ASIS Annual Meeting 
(ASIS ’ 88), volume 25, Atlanta, Georgia. American Society for Information Science. 

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990). 
Indexing by latent semantic analysis. Journal of the American Society for Information 
Science, 41(6):391-407. 

Fraleyand, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and 
density estimation. Journal of the American Statistical Association, 97:611-631. 


17 



Gat-Viks, L, Sharan, R., and Shamir, R. (2003). Scoring clustering solutions by their bio¬ 
logical relevance. Bioinformatics, 19(18) :2381 2389. 

Gonzalez, J. and Munoz, A. (2013). Functional analysis techniques to improve similarity 
matrices in discrimination problems. Journal of Multivariate Analysis, 120(C):120-134. 

Gower, J. (1977). The analysis of asymmetry and orthogonality, hi: Recent Developments 
in Statistics. Eds. J. Barra et al. Amsterdam: North Holland Press, pages 109-123. 

Gower, J. (1998). Orthogonality and its approximation in the analysis of asymmetry. Linear 
algebra and its applications, 278:183-193. 

Hastie, T., Tibshirani, R., and Friedman, J. (2009). The elements of statistical learning: 
Data mining, inference, and prediction, second edition. Springer Series in Statistics. 

Higham, N. J. (1986). Computing the polar decomposition with applications. SIAM J. Sci. 
Statist. Comput., 7:1160-1174. 

Horn, R. A. and R., J. C. (1991). Topics in matrix analysis. Cambridge University Press. 

Kim, K., Zhang, S., Jiang, K., Cai, L., Lee, I.-B., Feldman, L. J., and Huang, H. (2007). 
Measuring similarities between gene expression profiles through new data transformations. 
BMC bioinformatics, 8:29. 

Kohonen, T., Schroeder, M. R., and Huang, T. S., editors (2001). Self-Organizing Maps. 
Springer-Verlag New York, Inc., Secaucus, NJ, USA, 3rd edition. 

Kosko, B. (1991). Neural networks and fuzzy systems: A dynamical approach to machine 
intelligence. Prentice Hall. 

Martin de Diego, I., Munoz, A., and Martinez Moguerza, J. (2010). Methods for the combina¬ 
tion of kernel matrices within a support vector framework. Machine Learning, 78:137-174. 

Martin-Merino, M. and Munoz, A. (2005). Visualizing asymmetric proximities with som and 
mds models. Neurocomputing, 63:171-192. 

Munoz, A. (1997). Compound key word generation from document databases using a hier¬ 
archical clustering ART model. Intelligent Data Analysis, l(l-4):25-48. 

Munoz, A. and Gonzalez, J. (2007). Joint diagonalization of kernels for information fusion. 
In Proceedings of the Congress on Pattern Recognition 12th Iberoamerican Conference 
on Progress in Pattern Recognition, Image Analysis and Applications, CIARP’07, pages 
556-563, Berlin, Heidelberg. Springer-Verlag. 

18 



Munoz, A. and Gonzalez, J. (2008). Functional learning of kernels for information fusion 
purposes. In Ruiz-Shulcloper, J. and Kropatsch, W. G., editors, Cl ARP, volume 5197 of 
Lecture Notes in Computer Science , pages 277-283. Springer. 

Munoz, A. and Gonzalez, J. (2012). Hierarchical latent semantic class extraction using 
asymmetric term similarities. Behaviormetrika, 39(1) :91 109. 

Munoz, A., Gonzalez, J., and de Diego, I. M. (2006). Local linear approximation for kernel 
methods: The railway kernel. In Trinidad, J. F. M., Carrasco-Ochoa, J. A., and Kittler, 
J., editors, Cl ARP, volume 4225 of Lecture Notes in Computer Science, pages 936-944. 
Springer. 

Munoz, A., de Diego, I. M., and Moguerza, J. M. (2003). Support vector machine classi¬ 
fiers for asymmetric proximities. In Artificial Neural Networks and Neural Information 
ProcessingICANN/ICONIP 2003, pages 217-224. Springer. 

Ng, S.-K., Zhu, Z., and Ong, Y.-S. (2004). Whole-genome functional classification of genes 
by latent semantic analysis on microarray data. In Proceedings of the second conference 
on Asia-Pacific bioinformatics - Volume 29, APBC ’04, pages 123-129, Darlinghurst, 
Australia, Australia. Australian Computer Society, Inc. 

Okada, A. (1990). A generalization of asymmetric multidimensional scaling. In Knowledge, 
data and computer-assisted decisions, pages 127-138. Springer. 

Okada, A. and Imaizumi, T. (1987). Nonmetric multidimensional scaling of asymmetric 
proximities. Behaviormetrika, 21:81-96. 

Park, L. A. F. and Ramamohanarao, K. (2009). Kernel latent semantic analysis using an 
information retrieval based kernel. In CIKM, pages 1721-1724. 

Priness, I., Maimon, O., and Ben-Gal, I. E. (2007). Evaluation of gene-expression clustering 
via mutual information distance measure. BMC Bioinformatics, 8. 

Reka, A. and Barabasi (2002). Statistical mechanics of complex networks. Rev. Mod. Phys., 
74:47-97. 

Schoenberg, I. J. (1935). Remarks to maurice frchets article sur la dfinition axiomatique 
dune classe despaces distancis vectoriellcment applicable sur lespace de hilbert. annals of 
mathematics 36(3. 


19 



Wahba, G. (1990). Spline models for observational data. Series in Applied Mathematics, 
SIAM. Philadelphia , 59. 

Wang, Z., Xu, W., San Lucas, F. A., and Liu, Y. (2013). Incorporating prior knowledge into 
gene network study. Bioinformatics. 

Wuchty, S., Rasasz, E., and Barbarasi, A. L. (2003). The architecture of Biological Networks. 

Yang, E.-W., Girke, T., and Jiang, T. (2013). Differential gene expression analysis using 
coexpression and RNA-Seq data. Bioinformatics , 29(17) :2153-2161. 

Young, G. and Householder, A. S. (1938). Discussion of a set of points in terms of their 
mutual distances. Psychometrika, 3:19-22. 


20 



Latent genetic class 

Gene 1 

Gene 2 

Gene 3 

Gene 4 

Gene 5 

1 

7 

8 

619 

683 

1726 

2 

1891 

193 

187 

188 

186 

3 

1721 

1720 

1684 

1620 

1653 

4 

19 

59 

63 

76 

102 

5 

1339 

1619 

2040 

1596 

1470 

6 

1359 

574 

1451 

1729 

2007 

7 

130 

156 

157 

158 

249 

8 

496 

502 

515 

493 

494 

9 

242 

338 

369 

377 

380 

10 

253 

277 

278 

279 

281 

11 

996 

1000 

1045 

992 

1007 

12 

449 

451 

475 

485 

486 

13 

480 

576 

577 

578 

1168 

14 

828 

883 

884 

893 

913 


Table 2: 5 genes IDs with maximum probability in the mixture-model for each one of the 14 
latent genetic clusters. The label of each gene is given by the row position in the dataset of 
differentially expressed genes. 
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Cl 

C2 

C3 

C4 

C5 

C6 

C7 

C8 

C9 

CIO 

Cll 

C12 

C13 

C14 

BREAST 

0 

1 

0 

30 

0 

301 

24 

1 

11 

41 

8 

2 

14 

29 

CNS 

5 

30 

38 

11 

57 

67 

0 

11 

7 

0 

12 

2 

2 

1 

COLON 

0 

15 

5 

287 

0 

17 

0 

2 

5 

6 

0 

4 

7 

0 

K562A-repro 

1 

1 

0 

0 

0 

0 

0 

0 

62 

0 

0 

2 

0 

0 

K562B-repro 

0 

2 

0 

1 

0 

0 

0 

0 

64 
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0 

0 

1 

0 

LEUKEMIA 

0 

21 

13 

33 

0 

48 

1 

147 

62 

4 

1 

56 

2 

0 

MCF7A-repro 

0 

0 

0 

1 

0 

1 

0 

0 

1 

45 

0 

0 

0 

0 

MCF7D-repro 

0 

2 

0 

1 

0 

5 

0 

0 

0 

36 

0 

0 

1 

0 

MELANOMA 

1 

24 

39 

21 

0 

64 

0 

8 

6 

5 

307 

0 

8 

1 

NSCLC 

0 

259 

5 

25 

0 

80 

0 

1 

3 

3 

0 

1 

43 

0 

OVARIAN 

86 

36 

29 

42 

2 

36 

0 

12 

7 

3 

12 

1 

5 

0 

PROSTATE 

4 

11 

4 

3 

0 

11 

0 

0 

5 

0 

2 

0 

1 

0 

RENAL 

0 

47 

377 

17 

0 

87 

0 

6 

6 

1 

0 

0 

1 

0 

UNKNOWN 

5 

7 

2 

2 

0 

5 

0 

0 

0 

0 

0 

0 

0 

0 


Table 3: Correspondence between the 14 latent genetic estimated clusters and the genes 
membership to the different types of cancers. 
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